An 8-bit carry look-ahead adder with 150 ps latency and 
sub-microwatt power dissipation at 10 GHz 
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Reciprocal Quantum Logic combines the speed and power-efficiency of single-flux quantum su- 
perconductor devices with design features that are similar to CMOS. We have demonstrated an 
8-bit carry look-ahead adder in the technology using combinational gates with fanout of four and 
non-local interconnect. Measured power dissipation of the fully active circuit is only 510 nW at 
6.2 GHz. Latency is only 150 ps at a clock rate of 10 GHz. 



I. INTRODUCTION 

Superconducting Reciprocal Quantum Logic (RQL) is 
an ultra-low-power technology for high-performance com- 
puting that gives unmatched efficiency in terms of oper- 
ations per joule Simple experiments have shown bit 
energy approaching 1000 ksT, with further reduction ex- 
pected using smaller devices. This means that the tech- 
nology offers two orders of magnitude power savings over 
22 nm CMOS even after taking into account the overhead 
of the cryocooler, of order 1000 W/W at an operating 
temperature of 4.2 K RQL introduces a combination 
of new features that are unattainable by other supercon- 
ducting logic families , Q j including zero static power 
dissipation, stable timing, low bit-error-rate, low active- 
device count, and low- latency. The logic is combinational 
with multiple levels of logic per pipeline stage, following 
conventional CMOS behavior. This allows a wealth of 
CMOS design tools and methods to be applied. A small- 
current AC waveform provides power and a stable clock 
reference to each gate, which in principle allows the tech- 
nology to scale to VLSL 

Here we report the design, fabrication, and test of an 
RQL 8-bit carry look-ahead (CLA) adder as a bench- 
mark to demonstrate the effectiveness of the logic in a 
larger circuit. The CLA adder is an important hardware 
component for low-latency parallel addition. However, 
implementation of this circuit has been a major chal- 
lenge for superconducting technology in the past. Gate- 
level pipelining and inefficient clock distribution resulted 
in designs with high latency that defeated the purpose 
0. Asynchronous or self-timed designs can reduce la- 
tency [3, 01 J but high active device count, high current 
bias, and timing uncertainty limit increased integration 
scale. The RQL implementation of the adder achieves a 
tenfold improvement in latency and Joscphson-junction 
device count over earlier designs and shows that the main 
metric, power efficiency, is scalable to a large circuit as 
predicted by the first simple tests. 
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FIG. 1. Block diagram of the 8-bit Kogge-Stone CLA adder 
circuit and schematic of the individual CLA bit. 



II. LOGICAL DESIGN 

The adder, shown schematically in Fig. [1] computes 
the carry bits with minimal latency at the expense of 
four-way fan-out and high-density, long interconnects. 
The design is a Kogge-Stone radix-2 implementation that 
for N-bit inputs consists of log2 A^ -I- 2 stages. The first 
stage (A/OR) produces carry propagate and carry gen- 
erate signals. The following stages form the carry look- 
ahead (CLA) network that computes all carries in paral- 
lel. The final result is computed in the last stage where 
carries are summed with the input sums using logical 
XOR. 

The A/OR stage produces logical AND and OR for 
input bits Ai and -B,, expressed as Gi ~ AiBi, and 
Pi = Ai + Bi. These outputs are the generate and prop- 
agate signals, which have fanout of up to four to reach 
multiple CLA blocks in the second stage. Each stage of 
the CLA network takes select inputs from the previous 
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stage {Gi,Pi) and {Gj,Pj), and computes {Gout.i, Pout.i) 
expressed as Gouts = PiGj + G,;, and Pout.i = ^i^j - For 
the n^^ stage, input selection obeys j — i — 2". Inputs 
and outputs of the CLA blocks are pruned to stay within 
boundaries and to produce only the culminating gener- 
ate bits, which correspond to carries Ci. The final sum 
is computed as Si = Ai ® Bi ® Ci. 

RQL allows efficient design of the adder in terms of 
device count and latency. Unlike other superconductor 
logic, RQL gates arc combinational with behavioral de- 
scriptions similar to CMOS. As in CMOS circuits, this 
allows use of multiple levels of logic per pipe-line stage 
that greatly reduces latency. RQL logic has an addi- 
tional efficiency in wave pipelining that eliminates ac- 
tive latches to synchronize signals at each stage. Instead, 
four-phase power is used to move signals from stage to 
stage on the rising edge of the power waveform, which 
also serves as the clock. Data propagation speed finds 
equilibrium independently of initial conditions, clock am- 
plitude, parameter variations, and data pattern. For 
the 8-bit adder, 8 parallel clock lines provide power to 
each bit. There is no skew between clock lines provided 
that their geometrical lengths are equal. Each of the five 
stages of the adder takes one phase of the clock, which 
amounts to a total of 1.25 clock cycles for the computa- 
tion to complete. 

RQL gates have no static power dissipation. Power is 
only dissipated for logical "ones" , physically encoded as 
a pair of positive and negative (reciprocal) single-fiux- 
quantum (SFQ) pulses. The dynamic power dissipation 
is P = 0.33 /c$oA^/, where Ic is the average critical 
current of the Josephson junctions, N is the number of 
junctions, $o = h/2e = 2.068mVps is the flux quan- 
tum, / is operating frequency, and the prefactor of 0.33 
is experimentally determined [l|. AC loss in the super- 
conducting line is small Q. The AC power is applied 
through weak inductive coupling, which leads to impor- 
tant advantages: the gates are powered in series requiring 
small current, and dynamics in the circuit do not affect 
timing. An RQL circuit scaled to 2 million Josephson de- 
vices of 0.1mA average critical current would dissipate 
only 1.4 mW of power when fully active. Applied power 
would need to be larger than dissipated power in order to 
maintain clock stability in terms of amplitude and tim- 
ing. Since operating margins of the gates are sufficient 
to tolerate ±10% variation in clock current, only 4mW 
of clock power would be needed to power the 2-million- 
device chip, amounting to only 9 mA of current on a 50 
line. Timing variation would amount to only 5 ps, or 
±2% of the clock period at 10 GHz [jj. 

The adder was designed using gate-level VHDL mod- 
els. Only two basic RQL logic gates were used: AndOr 
and AnotB [l[ . The AndOr gate propagates the first log- 
ical "one" input in a given clock cycle to the Or output, 
and the second "one" input to the And output. The An- 
otB gate propagates a logical "one" on the A input to the 
output unless a logical "one" on the B input comes in the 
same clock cycle. Logical XOR can be expressed as "A 
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FIG. 2. Simulated operating margins on clock power supplied 
to the adder, as a function of clock rate. 



or B, but not both A and B," which is implemented by 
connecting the outputs of the AndOr gate to the inputs 
of AnotB gate. The CLA block, shown in Fig.[Tl contains 
And and Or gates. These are implemented by pruning 
the outputs of the AndOr gate. RQL gates require ac- 
tive interconnect, consisting of two sequential Josephson 
junctions, for signal amplification. The same active inter- 
connect is used both in splitters to produce fanout, and 
to produce a delay element. The five less-significant car- 
ries use delay cells to reach the last column. The AnotB 
gates in the second column produce the partial sum are 
in parallel with the CLA network and also use delay cells. 
The three more-significant carries set the overall latency 
of the circuit, and achieve up to eight levels of logic for 
a pipeline stage consisting of all four clock phases. 

The adder design was verified using WRSpice, a 
physical-level circuit simulator that includes a device 
model for the Josephson junction Q. Simulated operat- 
ing margins on clock power as a function of frequency for 
the entire eight-bit adder are shown in Fig. [2] The upper 
limit corresponds to over-bias and is relatively frequency 
independent. The lower limit narrows the operating re- 
gion with frequency, indicating that the circuit is latency- 
limited at high clock rate. The Josephson junction model 
used was for 1.5 iJ.m devices with a critical current den- 
sity of 4.5kA/cm^ and an I^R product of 0.7 mV, where 
Ic is the critical current, typically 141-200 |i.A in our cir- 
cuit, and R is the external shunt resistance. Such devices 
produce voltage pulses equal to the single-flux quantum, 
about 0.7 mV high and 3ps wide. Device delay for se- 
quentially wired junctions is also about 3 ps under nom- 
inal clock power. A clock rate of 10 GHz was the design 
target as there are up to eight sequential junctions per 
phase and four phases per clock cycle. In the circuit sim- 
ulation, this clock rate gives a wide clock-power operating 
margin of 4.6 dB. 
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FIG. 3. Picture of the fabricated 8-bit CLA on a 5 mm die, 
using a 2 |i-m superconducting Nb process. 
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FIG. 4. The single-channel input and eight output waveforms 
from the adder as captured on a samphng oscilloscope at a 
clock rate of 9.8 GHz. The patterns are 16 bits long, and 
the addends consist of cyclic permutations of the serial input. 
One addend corresponds to 8 serial input bits, and the other 
addend corresponds to the next 8 bits in reverse order. 



III. PHYSICAL DESIGN 

The CLA 5 mm x 5 mm chip is shown in Fig. [3] The 
circuit includes four passive, microwave Wilkinson power 
splitters for clock distribution [l^ and eight distributed 
amplifiers 11[ to output the result to room temperature. 
A 16-stage shift register with a single serial input is used 
to generate the two input words. Taps on the shift regis- 
ter feed the CLA inputs: starting with the LSB of input 
word A and working up, then wrapping around to the 
MSB of word B and working back down. In this way an 
arbitrary combination of input words can be applied ev- 
ery 16 clock cycles, with the same bits reused in different 
combinations in the intermediate clock cycles. 

The circuit was designed for a Nb superconductor 
foundry service [l^ that is more than adequate to yield 
the circuit. However, process integration scale is quite 
modest by CMOS standards, with only four metal layers 
and 4 \xm wire pitch, which limits circuit density. Two 
metal layers were used either for double ground planes 
surrounding the logic gates or for clock stripline, leav- 
ing only the remaining two metal layers for gates and 
interconnect. 

To accommodate physical layout, an additional idle 
phase consisting only of delay cells was added before the 
last CLA column to allow the three longest active in- 
terconnects to cover distance. This increased the total 
number of clock phases to six, which amounts to 1.5 clock 
cycles or 150 ps latency at a 10 GHz clock rate. In the 
final design the three long active interconnects were re- 
placed by 5 f2 passive superconducting striplines, which 
propagate signals at a 100 |xm/ps speed-of- light. The sig- 
nals are received by active interconnect circuits with the 
timing constraint that the receiver must be near the peak 



of the clock waveform to receive the positive pulse. 

The clock distribution network is implemented as two 
eight-way Wilkinson power splitters for the in-phase and 
quadrature clock phases, and two identical Wilkinson de- 
vices used in reverse to recombine the eight lines onto a 
single line |l3|. In this way the AC clock enters and ex- 
its the chip without ever contacting chip ground. The 
32 Vt impedance of the clock lines on chip is the largest 
impedance achievable using the bottom metal layer, with 
a 2 [im minimum line width, and using the top metal 
layer as ground. The Wilkinson splitters were designed 
for a center frequency of 7.5 GHz, and optimized to give 
better than 30 dB return loss from 5-10 GHz, and isola- 
tion better than 15 dB over the same range. A second 
version of the design used splitters centered at 15 GHz. 
A six-section design was chosen to transform the 50 51 
impedance of the feed line to the effective 4 $7 impedance 
of the eight 32 Q, clock lines in parallel. The circuit de- 
sign requires a nominal clock amplitude of 2 mA per line. 
Total input on each of the two clock lines amounts to 
only 3.2 mA rms. 



IV. CIRCUIT TEST 

Each chip is mounted in a pressure-contact cryoprobe 
and cooled to 4.2 K in a liquid helium dewar. The in- 
put is supplied from a digital pattern generator through 
an inductive coupling to produce SFQ signals on-chip, 
and returns to room temperature to be observed. The 
pattern generator is phase-locked to a synthesizer that 
generates the clock sinusoid, followed by a 90-degree hy- 
brid to split the clock into in-phase and quadrature. The 
clock lines also return to room temperature after indue- 
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FIG. 5. Measured operating margins on clock power as a 
function of clock rate, based on observation of the most- 
significant output bit. Three chip types were tested that 
differed in the design of the Wilkinson power splitter and cry- 
oprobe: 7.5 GHz Wilkinson and narrowband cryoprobe (Chip 
I), 7.5 GHz Wilkinson and wideband cryoprobe (Chip 2), and 
15 GHz Wilkinson and wideband cryoprobe (Chip 3). 



tivc couplings on chip and were terminated in 50 51 loads. 
The eight output bits from the adder are converted from 
SFQ signals to source-terminated voltages of 2 mV peak- 
to-peak on-chip. 

The high-speed input and output waveforms indicat- 
ing correct digital operation of the circuit are shown in 
Fig. 21 The input waveform is non-return-to-zero, while 
the output waveforms are return-to-zero and are inverted 
by low-noise amplifiers at room temperature. The visi- 
ble feed through of the clock sinusoids to the outputs is 
frequency dependent and is attributed to pickup in the 
cryoprobe package. Feed through from the —2 dBm clock 
lines to the outputs is down by 45 dB, which is small in 
absolute terms but is comparable to the —47 dBm output 
levels. This effect could be eliminated in a future chip 
design by using differential output. 

In simulation, the frequency-dependent operating mar- 
gins on clock power were limited by signal propagation 
time in the digital gates. In test, power margins can be 
dominated by the microwave design of the circuit, in- 
cluding the cryoprobe package. Different versions of the 
chip were tested in two different cryoprobes, as shown in 
Fig. [5] The narrowband probe had a two-layer printed 
circuit board (PCB) for signal traces and ground. Chip 
test using this probe showed very strong frequency de- 
pendence and was functional only in narrow frequency 
bands in the range 5-10 GHz. The highest observed fre- 
quency for correct operation was 9.8 GHz. The wide- 
band cryoprobe had a redesigned PCB and used a sec- 
ond ground plane to minimize crosstalk from the clock 
lines. The probe transition from coaxial line to stripline 
on-chip was optimized to give a return loss of better than 
20 dB from dc up to 20 GHz. Chip test using the wide- 
band probe produced a single large, continuous operating 
region that extended from 4-9 GHz. Another test, using 
the wideband probe and a chip with the Wilkinson power 
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FIG. 6. S-parameters of the clock line for the 7.5 GHz- 
Wilkinson power-splitter chip and the high-bandwidth probe, 
with the calibration plane at the probe head. 



network scaled up by a factor of two in frequency, pro- 
duced a small operating region observed around 11 GHz. 

Digital operation of the chip (Chip 2 in Fig. [5]) can 
be compared to S-parameter measurements of the clock 
distribution network of the same chip, shown in Fig. [51 
The sum of transmitted and reflected power measured 
on the clock line roughly corresponds to the round-trip 
attenuation of the cables in the probe, indicating small 
losses due to radiation and coupling to other lines across 
the entire frequency range up to 20 GHz. Transmitted 
power shows the bandwidth of the Wilkinson splitter to 
be about 4-10 GHz, and the center appears to be con- 
sistent with the 7.5 GHz design value. Small reflected 
power is a key figure of merit, as resonances create stand- 
ing waves, leading to non-uniform biasing of the circuitry 
and narrowed clock power operating margins. Within the 
bandwidth of the Wilkinson, reflected power is 10-22 dB 
less than transmitted power, with several resonances vis- 
ible. Qualitatively similar behavior was observed for the 
low bandwidth probe. However, there appears to be little 
correlation between power margin measurements of the 
digital circuit and these resonances. This indicates that 
S-parameter measurements alone cannot adequately pre- 
dict frequency-dependent digital operation of the chip. 
In any case, for any specific application the clock will 
be single-tone and thus narrow-band. This significantly 
relaxes microwave design constraints. 



V. POWER DISSIPATION MEASUREMENT 

A final test measured dynamic power dissipation of the 
circuit. This is done by observing the attenuation of the 
clock signals due to the power draw of device-switching 
events in the circuit. With an applied clock power of 
about 0.6 mW, the measurement must be sensitive to bet- 
ter than one part per thousand to be able to resolve the 
expected dissipation. 
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FIG. 7. Spectrum of the modulated clock output at 6.2 GHz, 
measured on a spectrum analyzer, when the adder circuit was 
fed with a repetitive data input pattern of 12k pseudo-random 
bits followed by 12k zero bits. The measured power ratio of 
the sidebands that are 259 kHz away from the fundamental is 
shown. 



To achieve this accuracy we use a modulation tech- 
nique vifhere the input data pattern is periodically 
chopped, alternating between all-zeros and a pseudo- 
random pattern. Since none of the Josephson junction 
devices in the circuit are active for an all-zeros input, 
this modulates the AC clock signal at the data chopping 
frequency, producing sidebands that are measurable with 
a spectrum analyzer. Dynamic power dissipation of the 
circuit is calculated from the power in the sidebands. 

Fig.[7]shows the spectra of the two clock lines returned 
from the chip. The two clock quadratures I and Q had 
respective power levels of — 2.4dBm and — 2.0dBm at 
the chip, measured as the geometric mean of applied 
and returned power to account for attenuation in the 
probe cables. Sidebands are visible above and below the 
carrier with a spacing of 259 kHz, which corresponds to 
the fundamental of the modulating square wave. The 
power ratio between a single sideband and the carrier, 
SSB = Pssb/Pqi was measured to be — 69.3dB for clock 
Q and — 79.3dB for clock L Additional sidebands corre- 
sponding to higher odd harmonics of the square wave fall 
outside the frequency range of our measurement. 

The sidebands can arise either by amplitude modu- 
lation (AM), or by phase modulation (PM). Both may 
be present in the the CLA adder circuit, but only AM 
is indicative of power dissipation. The Josephson junc- 
tion circuit elements, which are inductively coupled to 
the clock, are effectively in series with the clock line. 
The non-switching junction can be modeled as an induc- 
tor, the switching junction as a resistor. The switching 
junctions take power from the clock line, producing AM, 



and decrease the propagation speed of the clock, which 
gives rise to PM. We will first establish an upper bound 
on power dissipation assuming purely AM, and then esti- 
mate dissipation accounting for the relative contributions 
of AM and PM, using previously reported measurements 
of an RQL circuit [l[. 

For the case of purely AM, the ratio of power dissipated 
to power in the clock line, AP/Pq = (Vi^i — Vi^) /Vq , where 
Vhi = 1/0(1 + 2Kqr/V^,) and Mo = l/o(l - 214qr/l/o) are the 
maximum and minimum amplitude of the clock sinusoid. 
Vq is the amplitude of the clock carrier and Mqr is the 
amplitude of the square wave modulation. The factors 
of two account for the presence of double sidebands, one 
above and one below the carrier frequency. The funda- 
mental of the square wave has an amplitude 4/7r relative 
to the amplitude of the square wave itself, so the am- 
plitude of the single sideband, Vssb 
normalized power dissipation is 



(4/^)14qr. The 
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This is an upper bound on power dissipation on the chip. 
It would also be possible to produce the power spectrum 
measured in Fig. [7] by phase modulation, without any 
power dissipation. Our previous measurement of clock 
stability in an RQL shift register provides an estimate of 
the relative contributions of AM and PM to the power 
spectrum. 

In order to estimate the AM and PM contributions 
to measured sideband power we calculate the factors nia 
and TOp corresponding to the AM and PM amplitudes for 
a simple sine-modulated waveform 

F = Vb [1 + ma s\n(uJrnt)] s\n[u!ct + mp sin(cjmi)], (2) 

where Vq and Wc are the amplitude and frequency of 
the carrier, and ujm is the frequency of the modula- 
tion. From this equation ma can be found by solving 
Pio/Phi = V^JV^. = (1 - 2m,)/(l + 2m,), as the wave- 
form amplitude ranges over Vo(l ± ma). Similarly, mp 
can be found by solving At = thi ^ ^lo = '^mp/uj^., as 
the waveform phase ranges over ujct ± mp. Both the 
data-modulated power ratio, P\o/ Phi — 0.91, and a data- 
modulated time delay. At = 1.4 ps, were previously re- 
ported in [H for an RQL shift register operating at 6 GHz. 
These measurements give TOq = 0.023 and mp = 0.026, 
indicating that AM and PM arc about equal at the clock 
rate of interest. 

The amplitudes of the AM and PM sidebands add in 
quadrature, so for equal AM and PM, only half of the 
power in the observed sideband is attributable to AM. In- 
cluding a correction factor of 1/2 (or equivalently, —3 dB) 
to the AM sideband power of equation [H the normalized 
power dissipation is 
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where SSB is the power ratio of a single sideband to the 
carrier expressed in dB. The correction factor reduces 
the estimate of power dissipation compared to the upper 
bound set by purely AM by 30%. 

Using the measured values for SSB in equation [3l we 
calculate that the active power dissipation was 970 nW 
in clock Q and 280 nW in clock I. The difference in power 
dissipation between the two clock quadratures is due to 
the output amplifiers, which are powered exclusively on 
clock Q and make up 50% of the junction critical current 
on the chip. Using the measured values for Pq on the 
two clock lines, total power dissipation on chip amounts 
to 1.25 |J.W. Excluding the amplifiers and the input shift 
register, the CLA core makes up 42% of total device crit- 
ical current, so power dissipation in the CLA amounts to 
510 nW. 

This result is in agreement with the previously re- 
ported dynamic power dissipation P = 0.33 Ic^oNf for a 
simple RQL shift register. The CLA core, excluding the 
input shift register and output amplifiers, has = 815 
junctions of average critical current Ic = 162 |j.A, so 
power dissipation at / = 6.21 GHz is expected to be 
560 nW. We note that this is equivalent to the static 
power dissipation of a single bias resistor in the incum- 
bent superconductor logic family, RSFQ Q, which typi- 
cally supplies 200 ^lA from a 2.6 mV bus. 

VI. CONCLUSION 

We report an 8-bit carry look-ahead adder that ad- 
vances reciprocal quantum logic from the first benchmark 
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